The document contains a detailed analysis of forecast sale expectations of High-End Vacuums for Vac-Attack company in December 2020 using historical data. This analysis aims to determine if the company will meet the target and ensure the stock is on hand to match the demand.
The report describes data cleaning, summary statistics, visualisation and analysis using the statistical model fitting, implemented with R programming language.
if (!require(tidyverse))
install.packages("tidyverse")
if (!require(lubridate))
install.packages("lubridate")
if (!require(plotly))
install.packages("plotly")
if (!require(car))
install.packages("stargazer")
if (!require(xgboost))
install.packages("xgboost")
library(tidyverse)
library(lubridate)
library(stargazer)
library(car)
library(xgboost)First import the Market data.
market_sale_col <- readr::read_csv("Data/MarketingCols.csv", col_names = FALSE)
market_sale_raw <- readr::read_csv("Data/MarketingSales.csv",
col_names = market_sale_col$X1)Then import the December data for prediction.
dec_col <- readr::read_csv("Data/DecemberCols.csv", col_names = FALSE)
dec_ad_raw <- readr::read_csv("Data/DecemberAdData.csv",
col_names = dec_col$X1)Quick explore of the read-in the data sets
dim(market_sale_raw)## [1] 1796 11
glimpse(market_sale_raw)## Rows: 1,796
## Columns: 11
## $ Date <chr> "1/01/16", "2/01/16", "3/01/16", "4/01/16", "5/…
## $ PositiveNews <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ NegativeCoverage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Competition <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AdvertisingSpend <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, …
## $ Month <chr> "January", "January", "January", "January", "Ja…
## $ Day <chr> "Friday", "Saturday", "Sunday", "Monday", "Tues…
## $ `0508Line_247` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ UltraEdition_Available <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COVID_Lockdown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Sales <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83,…
dim(dec_ad_raw)## [1] 31 4
glimpse(dec_ad_raw)## Rows: 31
## Columns: 4
## $ Date <chr> "1/12/20", "2/12/20", "3/12/20", "4/12/20", "5/12/20"…
## $ AdvertisingSpend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17…
## $ Month <chr> "December", "December", "December", "December", "Dece…
## $ Day <chr> "Tuesday", "Wednesday", "Thursday", "Friday", "Saturd…
The next step is to tidy up the data sets for visualisation and
modelling. I have also included year, month and day variables in both
factor and numeric formats.
market_sale_final <-
market_sale_raw %>%
mutate(Date = dmy(Date),
Ad_Spend = AdvertisingSpend,
Phone_24 = factor(`0508Line_247`),
Positive_News = factor(PositiveNews),
Negative_News = factor(NegativeCoverage),
Competition = factor(Competition),
Ultra_Edition = factor( UltraEdition_Available),
COVID_Lockdown = factor(COVID_Lockdown),
Year = factor(year(Date)),
Month = factor(Month, levels = month.name),
Day = factor(Day, levels =
c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")),
Month_num = as.numeric(Month),
Year_num = year(Date),
Day_num = as.numeric(Day),
Date_num = date(Date)) %>%
select(Sales, Ad_Spend, Date, Phone_24, Positive_News, Negative_News,
Competition, Ultra_Edition, COVID_Lockdown,
Year, Month, Day, Day_num, Date_num, Month_num, Year_num)
dim(market_sale_final)## [1] 1796 16
glimpse(market_sale_final)## Rows: 1,796
## Columns: 16
## $ Sales <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83, 62, 94,…
## $ Ad_Spend <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, 3962.76,…
## $ Date <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Phone_24 <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Positive_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Negative_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ Month <fct> January, January, January, January, January, January, J…
## $ Day <fct> Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, T…
## $ Day_num <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ Date_num <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Month_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Year_num <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
dec_ad_final <-
dec_ad_raw %>%
mutate(
Date = dmy(Date),
Ad_Spend = AdvertisingSpend,
Phone_24 = factor(0, levels = c(0, 1)),
Positive_News = factor(0, levels = c(0, 1)),
Negative_News = factor(0, levels = c(0, 1)),
Competition = factor(0, levels = c(0, 1)),
Ultra_Edition = factor(1, levels = c(0, 1)),
COVID_Lockdown = factor(0, levels = c(0, 1)),
Year = factor(2020, levels = market_sale_final$Year |> levels() ),
Month = factor(Month, levels = month.name),
Day = factor(Day, levels =
c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")),
Year_num = 2020,
Month_num = 12,
Day_num = as.numeric(Day),
Date_num = date(Date)
) %>%
select(Ad_Spend, Date, Phone_24, Positive_News, Negative_News,
Competition, Ultra_Edition,
COVID_Lockdown, Year, Month, Day, Day_num, Date_num, Month_num, Year_num)
dim(dec_ad_final)## [1] 31 15
glimpse(dec_ad_final)## Rows: 31
## Columns: 15
## $ Ad_Spend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17, …
## $ Date <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Phone_24 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Positive_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Negative_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ Month <fct> December, December, December, December, December, Decem…
## $ Day <fct> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,…
## $ Day_num <dbl> 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6…
## $ Date_num <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Month_num <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Year_num <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
I will first generate some plots to look for any relationship.
The first plot is total unit sold versus time.
ggplot(market_sale_final, aes(x = Date, y = Sales)) +
geom_path() +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light()There is clearing some seasonality on the total unit sold versus time for adjustment
The second plot is to examine relationship between the total unit sold versus the total advertising spend.
ggplot(market_sale_final, aes(x = Ad_Spend, y = Sales)) +
geom_point()There is a weak positive correlation between Advertising Spend and total unit sold, with the pearson correlation of 41.1%.
ggplot(market_sale_final, aes(x = Date)) +
geom_line(aes(y = Sales), linewidth = 1.5, color = "red", alpha = 0.8) +
geom_line(aes(y = Ad_Spend/100), color = "blue", alpha = 0.8) +
scale_y_continuous(
name = "The units sold",
# Add a second axis and specify its features
sec.axis = sec_axis(~.*100, name="Total Advertising Spend ($)")
) +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light() +
theme(
axis.title.y = element_text(color = "red", size=13),
axis.title.y.right = element_text(color = "blue", size=13)
)hist(market_sale_final$Sales)The overall number of Sales is looking reasonably normal, thus there is no need to apply any additional normalization methods.
Total units sold versus years.
market_sale_final %>%
group_by(Year) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales)) %>%
knitr::kable()| Year | Sales_mean | Sales_sum |
|---|---|---|
| 2016 | 89.48087 | 32750 |
| 2017 | 82.02740 | 29940 |
| 2018 | 76.81370 | 28037 |
| 2019 | 78.07671 | 28498 |
| 2020 | 66.04179 | 22124 |
The total units sold appears to decline from year to year.
Total units sold versus months
market_sale_final %>%
group_by(Month) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Month | Sales_mean | Sales_sum |
|---|---|---|
| January | 66.17419 | 10257 |
| February | 65.78169 | 9341 |
| March | 64.52258 | 10001 |
| April | 59.66000 | 8949 |
| May | 65.76129 | 10193 |
| June | 71.71333 | 10757 |
| July | 71.52258 | 11086 |
| August | 80.06452 | 12410 |
| September | 88.28000 | 13242 |
| October | 91.63226 | 14203 |
| November | 107.91333 | 16187 |
| December | 118.73387 | 14723 |
The most of total units sold appears to be at the later months of the year.
Total units sold versus days of the week.
market_sale_final %>%
group_by(Day) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Day | Sales_mean | Sales_sum |
|---|---|---|
| Monday | 78.26848 | 20115 |
| Tuesday | 78.85156 | 20186 |
| Wednesday | 77.59766 | 19865 |
| Thursday | 78.82422 | 20179 |
| Friday | 79.33852 | 20390 |
| Saturday | 78.82101 | 20257 |
| Sunday | 79.21012 | 20357 |
The total units sold appears to very similar between the days of each week.
market_sale_final %>%
group_by(Ultra_Edition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Ultra_Edition | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 81.28650 | 74052 |
| 1 | 76.04181 | 67297 |
market_sale_final %>%
group_by(Year, Ultra_Edition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Year | Ultra_Edition | Sales_mean | Sales_sum |
|---|---|---|---|
| 2016 | 0 | 89.48087 | 32750 |
| 2017 | 0 | 82.02740 | 29940 |
| 2018 | 0 | 63.12222 | 11362 |
| 2018 | 1 | 90.13514 | 16675 |
| 2019 | 1 | 78.07671 | 28498 |
| 2020 | 1 | 66.04179 | 22124 |
market_sale_final %>%
group_by(Positive_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Positive_News | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 78.22094 | 135948 |
| 1 | 93.12069 | 5401 |
market_sale_final %>%
group_by(Negative_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales)) %>%
knitr::kable()| Negative_News | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 78.85618 | 140364 |
| 1 | 61.56250 | 985 |
market_sale_final %>%
group_by(Positive_News, Negative_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()## `summarise()` has grouped output by 'Positive_News'. You can override using the
## `.groups` argument.
| Positive_News | Negative_News | Sales_mean | Sales_sum |
|---|---|---|---|
| 0 | 0 | 78.37573 | 134963 |
| 0 | 1 | 61.56250 | 985 |
| 1 | 0 | 93.12069 | 5401 |
market_sale_final %>%
group_by(Competition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Competition | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 80.09143 | 111247 |
| 1 | 73.96069 | 30102 |
market_sale_final %>%
group_by(COVID_Lockdown) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| COVID_Lockdown | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 80.24484 | 139947 |
| 1 | 26.96154 | 1402 |
market_sale_final %>%
group_by(Phone_24) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Phone_24 | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 76.10649 | 95057 |
| 1 | 84.62888 | 46292 |
Since the distribution of the total units sold is relatively normal with a belt-shaped curve, I will start by fitting a multiple linear regression model for the total units sold, as seen below. Using a multiple linear regression model is a good starting point, because it is relatively straightforward to fit the model and perform model diagnostics to decide whether a more complicated modelling technique is required. Further, linear regression analysis makes inference easy, i.e. interpreting which predictor variable has statistically significant effects or has the most influence on the target variable. Lastly, linear regression analysis can be used to make predictions about the value of the target variable based on the values of the predictor variables.
linear_reg_fit <-
lm(
Sales ~ Ad_Spend + (Year * Month) / Day +
COVID_Lockdown + Ultra_Edition + Phone_24 +
Positive_News + Negative_News + Competition,
data = market_sale_final
)ANOVA table of the initial model
anova(linear_reg_fit)## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Ad_Spend 1 151427 151427 5321.5665 < 2.2e-16 ***
## Year 4 94608 23652 831.1967 < 2.2e-16 ***
## Month 11 517817 47074 1654.3260 < 2.2e-16 ***
## COVID_Lockdown 1 45137 45137 1586.2609 < 2.2e-16 ***
## Ultra_Edition 1 301 301 10.5642 0.001181 **
## Phone_24 1 10715 10715 376.5428 < 2.2e-16 ***
## Positive_News 1 9672 9672 339.8899 < 2.2e-16 ***
## Negative_News 1 7108 7108 249.8029 < 2.2e-16 ***
## Competition 1 109 109 3.8149 0.051001 .
## Year:Month 42 13192 314 11.0382 < 2.2e-16 ***
## Year:Month:Day 354 8297 23 0.8236 0.987476
## Residuals 1377 39183 28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Choose a model by AIC in a Stepwise Algorithm
linear_reg_fit_final <- step(linear_reg_fit) ## Start: AIC=6374.49
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition +
## Phone_24 + Positive_News + Negative_News + Competition
##
##
## Step: AIC=6374.49
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month +
## Year:Month:Day
##
## Df Sum of Sq RSS AIC
## - Year:Month:Day 354 8297 47479 6011.4
## - Ultra_Edition 1 21 39204 6373.5
## - Competition 1 39 39222 6374.3
## <none> 39183 6374.5
## - Negative_News 1 6008 45191 6628.7
## - Positive_News 1 7675 46858 6693.7
## - COVID_Lockdown 1 15661 54844 6976.4
## - Ad_Spend 1 114379 153562 8825.6
##
## Step: AIC=6011.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month
##
## Df Sum of Sq RSS AIC
## - Ultra_Edition 1 23 47502 6010.3
## - Competition 1 30 47510 6010.6
## <none> 47479 6011.4
## - Negative_News 1 8015 55494 6289.6
## - Positive_News 1 9093 56572 6324.1
## - COVID_Lockdown 1 15965 63444 6530.0
## - Year:Month 43 22732 70212 6628.0
## - Ad_Spend 1 135383 182862 8431.2
##
## Step: AIC=6010.29
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News +
## Negative_News + Competition + Year:Month
##
## Df Sum of Sq RSS AIC
## - Competition 1 30 47532 6009.4
## <none> 47502 6010.3
## - Negative_News 1 8015 55517 6288.3
## - Positive_News 1 9085 56588 6322.6
## - COVID_Lockdown 1 15965 63467 6528.7
## - Year:Month 43 23191 70694 6638.3
## - Ad_Spend 1 135379 182881 8429.4
##
## Step: AIC=6009.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News +
## Negative_News + Year:Month
##
## Df Sum of Sq RSS AIC
## <none> 47532 6009.4
## - Negative_News 1 8015 55547 6287.3
## - Positive_News 1 9081 56614 6321.4
## - COVID_Lockdown 1 15964 63497 6527.5
## - Year:Month 43 23830 71363 6653.3
## - Ad_Spend 1 135661 183193 8430.5
ANOVA table of the final model
anova(linear_reg_fit_final) ## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Ad_Spend 1 151427 151427 5520.900 < 2.2e-16 ***
## Year 4 94608 23652 862.331 < 2.2e-16 ***
## Month 11 517817 47074 1716.293 < 2.2e-16 ***
## COVID_Lockdown 1 45137 45137 1645.678 < 2.2e-16 ***
## Positive_News 1 9692 9692 353.375 < 2.2e-16 ***
## Negative_News 1 7520 7520 274.174 < 2.2e-16 ***
## Year:Month 43 23830 554 20.205 < 2.2e-16 ***
## Residuals 1733 47532 27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Parameter estimates
summary(linear_reg_fit_final) ##
## Call:
## lm(formula = Sales ~ Ad_Spend + Year + Month + COVID_Lockdown +
## Positive_News + Negative_News + Year:Month, data = market_sale_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6621 -3.6102 -0.2807 3.3290 27.3765
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.565e+01 9.609e-01 68.324 < 2e-16 ***
## Ad_Spend 1.416e-03 2.014e-05 70.328 < 2e-16 ***
## Year2017 1.754e+00 1.331e+00 1.318 0.187765
## Year2018 -1.706e+01 1.331e+00 -12.817 < 2e-16 ***
## Year2019 -2.031e+01 1.331e+00 -15.255 < 2e-16 ***
## Year2020 -2.133e+01 1.331e+00 -16.025 < 2e-16 ***
## MonthFebruary -1.076e+00 1.355e+00 -0.794 0.427270
## MonthMarch 2.185e-01 1.331e+00 0.164 0.869649
## MonthApril -1.851e+00 1.342e+00 -1.379 0.168043
## MonthMay -2.427e+00 1.331e+00 -1.824 0.068390 .
## MonthJune 1.771e+00 1.342e+00 1.320 0.186943
## MonthJuly 9.950e-01 1.330e+00 0.748 0.454593
## MonthAugust 9.419e+00 1.332e+00 7.069 2.26e-12 ***
## MonthSeptember 1.897e+01 1.341e+00 14.143 < 2e-16 ***
## MonthOctober 2.745e+01 1.332e+00 20.605 < 2e-16 ***
## MonthNovember 4.178e+01 1.342e+00 31.126 < 2e-16 ***
## MonthDecember 4.905e+01 1.331e+00 36.860 < 2e-16 ***
## COVID_Lockdown1 -3.483e+01 1.444e+00 -24.125 < 2e-16 ***
## Positive_News1 1.290e+01 7.090e-01 18.196 < 2e-16 ***
## Negative_News1 -2.277e+01 1.332e+00 -17.094 < 2e-16 ***
## Year2017:MonthFebruary -1.025e+01 1.926e+00 -5.323 1.16e-07 ***
## Year2018:MonthFebruary 7.149e+00 1.924e+00 3.715 0.000210 ***
## Year2019:MonthFebruary 3.171e+00 1.924e+00 1.649 0.099408 .
## Year2020:MonthFebruary 5.005e+00 1.914e+00 2.615 0.008990 **
## Year2017:MonthMarch -3.546e+00 1.883e+00 -1.883 0.059855 .
## Year2018:MonthMarch 7.306e-01 1.883e+00 0.388 0.698001
## Year2019:MonthMarch 2.308e+00 1.882e+00 1.226 0.220208
## Year2020:MonthMarch 5.374e+00 1.910e+00 2.813 0.004961 **
## Year2017:MonthApril -6.831e+00 1.898e+00 -3.599 0.000329 ***
## Year2018:MonthApril 4.347e+00 1.898e+00 2.291 0.022106 *
## Year2019:MonthApril 8.629e+00 1.898e+00 4.546 5.86e-06 ***
## Year2020:MonthApril 8.313e+00 2.385e+00 3.486 0.000502 ***
## Year2017:MonthMay 1.501e+00 1.882e+00 0.797 0.425282
## Year2018:MonthMay 6.937e+00 1.882e+00 3.686 0.000235 ***
## Year2019:MonthMay 1.105e+01 1.882e+00 5.870 5.22e-09 ***
## Year2020:MonthMay 1.115e+01 2.007e+00 5.555 3.21e-08 ***
## Year2017:MonthJune -4.719e+00 1.898e+00 -2.487 0.012975 *
## Year2018:MonthJune 3.308e+00 1.897e+00 1.743 0.081469 .
## Year2019:MonthJune 1.280e+01 1.900e+00 6.740 2.14e-11 ***
## Year2020:MonthJune 8.612e+00 1.898e+00 4.538 6.06e-06 ***
## Year2017:MonthJuly -1.327e+01 1.882e+00 -7.050 2.57e-12 ***
## Year2018:MonthJuly 8.134e+00 1.881e+00 4.324 1.62e-05 ***
## Year2019:MonthJuly 1.871e+01 1.881e+00 9.947 < 2e-16 ***
## Year2020:MonthJuly 8.028e+00 1.882e+00 4.265 2.11e-05 ***
## Year2017:MonthAugust -9.244e+00 1.885e+00 -4.904 1.02e-06 ***
## Year2018:MonthAugust 5.231e+00 1.884e+00 2.776 0.005563 **
## Year2019:MonthAugust 1.030e+01 1.883e+00 5.472 5.09e-08 ***
## Year2020:MonthAugust 1.518e+01 1.885e+00 8.052 1.50e-15 ***
## Year2017:MonthSeptember -1.093e+01 1.897e+00 -5.760 9.95e-09 ***
## Year2018:MonthSeptember 1.773e+00 1.897e+00 0.935 0.350071
## Year2019:MonthSeptember 1.298e+01 1.898e+00 6.839 1.10e-11 ***
## Year2020:MonthSeptember 1.043e+01 1.897e+00 5.496 4.46e-08 ***
## Year2017:MonthOctober -1.429e+01 1.884e+00 -7.584 5.43e-14 ***
## Year2018:MonthOctober 2.755e+00 1.883e+00 1.463 0.143639
## Year2019:MonthOctober 5.206e+00 1.885e+00 2.762 0.005803 **
## Year2020:MonthOctober 8.350e+00 1.882e+00 4.437 9.69e-06 ***
## Year2017:MonthNovember -1.919e+01 1.897e+00 -10.118 < 2e-16 ***
## Year2018:MonthNovember 7.326e+00 1.898e+00 3.860 0.000117 ***
## Year2019:MonthNovember 1.230e+01 1.898e+00 6.482 1.18e-10 ***
## Year2020:MonthNovember 4.986e+00 1.899e+00 2.626 0.008728 **
## Year2017:MonthDecember -1.355e+01 1.882e+00 -7.202 8.83e-13 ***
## Year2018:MonthDecember 6.887e+00 1.884e+00 3.656 0.000264 ***
## Year2019:MonthDecember 9.716e+00 1.882e+00 5.161 2.73e-07 ***
## Year2020:MonthDecember NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.237 on 1733 degrees of freedom
## Multiple R-squared: 0.947, Adjusted R-squared: 0.9451
## F-statistic: 499.9 on 62 and 1733 DF, p-value: < 2.2e-16
sum_stats <- summary(linear_reg_fit_final)Note that this multiple linear regression model has an adjusted R-square of 0.9451481, which means this model can explain 94.51% of the information, thus a very accurate model.
hist(linear_reg_fit_final$residuals)plot(fitted(linear_reg_fit_final), resid(linear_reg_fit_final) )# calculate the mean squared error
mse_linear <-
sum(market_sale_final$Sales -
predict.glm(linear_reg_fit_final,
newdata = market_sale_final,
type = "response") ) ^2## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
The AIC and mean square error from the Poisson regression model are 1.1108257^{4} and 1.7406663^{-19}, respectively. This histogram of the residual is relatively normal, and the residual plot is randomly scattered around the residual of zero.
Poisson regression is often used for modeling count data. Thus, below is the modelling results using the Poisson regression analysis.
poisson_reg_fit <-
glm(
Sales ~ Ad_Spend + (Year * Month) / Day +
COVID_Lockdown + Ultra_Edition + Phone_24 +
Positive_News + Negative_News + Competition,
data = market_sale_final,
family = poisson()
)ANOVA table of the initial model
car::Anova(poisson_reg_fit) ## Analysis of Deviance Table (Type II tests)
##
## Response: Sales
## LR Chisq Df Pr(>Chisq)
## Ad_Spend 1364.9 1 < 2.2e-16 ***
## Year 15.4 4 0.003935 **
## Month 3501.8 11 < 2.2e-16 ***
## COVID_Lockdown 356.0 1 < 2.2e-16 ***
## Ultra_Edition 0.4 1 0.506484
## Phone_24 0
## Positive_News 88.1 1 < 2.2e-16 ***
## Negative_News 80.1 1 < 2.2e-16 ***
## Competition 0.3 1 0.594721
## Year:Month 216.2 42 < 2.2e-16 ***
## Year:Month:Day 131.5 354 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Choose a model by AIC in a Stepwise Algorithm
poisson_reg_fit_final <- step(poisson_reg_fit)## Start: AIC=12543.89
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition +
## Phone_24 + Positive_News + Negative_News + Competition
##
##
## Step: AIC=12543.89
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month +
## Year:Month:Day
##
## Df Deviance AIC
## - Year:Month:Day 354 772.78 11967
## - Competition 1 641.53 12542
## - Ultra_Edition 1 641.69 12542
## <none> 641.25 12544
## - Negative_News 1 721.35 12622
## - Positive_News 1 729.31 12630
## - COVID_Lockdown 1 997.21 12898
## - Ad_Spend 1 2006.15 13907
##
## Step: AIC=11967.41
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month
##
## Df Deviance AIC
## - Competition 1 772.98 11966
## - Ultra_Edition 1 773.15 11966
## <none> 772.78 11967
## - Negative_News 1 875.60 12068
## - Positive_News 1 877.81 12070
## - Year:Month 43 1106.23 12215
## - COVID_Lockdown 1 1130.86 12324
## - Ad_Spend 1 2394.15 13587
##
## Step: AIC=11965.61
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Year:Month
##
## Df Deviance AIC
## - Ultra_Edition 1 773.35 11964
## <none> 772.98 11966
## - Negative_News 1 875.80 12066
## - Positive_News 1 877.97 12069
## - Year:Month 43 1122.66 12229
## - COVID_Lockdown 1 1131.06 12322
## - Ad_Spend 1 2397.68 13588
##
## Step: AIC=11963.99
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News +
## Negative_News + Year:Month
##
## Df Deviance AIC
## <none> 773.35 11964
## - Negative_News 1 876.17 12065
## - Positive_News 1 878.24 12067
## - Year:Month 43 1135.63 12240
## - COVID_Lockdown 1 1131.43 12320
## - Ad_Spend 1 2397.95 13587
ANOVA table of the final model
car::Anova(poisson_reg_fit_final)## Analysis of Deviance Table (Type II tests)
##
## Response: Sales
## LR Chisq Df Pr(>Chisq)
## Ad_Spend 1624.6 1 < 2.2e-16 ***
## Year 537.0 4 < 2.2e-16 ***
## Month 5536.1 11 < 2.2e-16 ***
## COVID_Lockdown 358.1 1 < 2.2e-16 ***
## Positive_News 104.9 1 < 2.2e-16 ***
## Negative_News 102.8 1 < 2.2e-16 ***
## Year:Month 362.3 43 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hist(poisson_reg_fit$residuals)plot(fitted(poisson_reg_fit), resid(poisson_reg_fit) )# calculate the mean squared error
mse_poisson <-
sum(market_sale_final$Sales -
predict.glm(poisson_reg_fit_final,
newdata = market_sale_final,
type = "response") ) ^2## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
The AIC (1.1963986^{4}) and mean square error (8.6403951^{-18}) from the Poisson regression model are slightly worse than the estimates from the linear regression above.
Finally, I will also use the XGBoost model to see if we can obtain a model with an even smaller mean square error to ensure the total sales prediction is accurate and robust. When prepping the data for the model training below, I did not split the data into training and testing. This is because the two models above both compute the mean square error with the training data set only. Thus, I would not split the testing dataset to consistently compare the mean square errors between different models.
market_sale_final_mat <-
market_sale_final %>%
select(Sales, Ad_Spend, Phone_24, Positive_News, Negative_News, Competition,
Ultra_Edition, COVID_Lockdown, Year_num, Month_num, Day_num) %>%
mutate(Year_month = paste0(Year_num, Month_num)) %>%
as.matrix()
market_sale_final_mat <-
apply(market_sale_final_mat, 2, as.numeric)
train <- market_sale_final_mat
# create DMatrix objects for training and test sets
dtrain <- xgb.DMatrix(data = train[,-1], label = train[, "Sales"])
# specify the model parameters
params <- list(booster = "gbtree", objective = "reg:squarederror",
eta = 0.3, max_depth = 3,subsample = 0.8,
colsample_bytree = 0.8, nthread = 2)
#tuning hyperparameter
gridsearch_params <- list(max_depth = c(2,4,6),min_child_weight = c(1,3,5))
cv_result <-
xgb.cv(params = params, data = dtrain, nfold = 5, nrounds = 100,
num_boost_round = 100, early_stopping_rounds = 10,
metrics = "rmse", verbose = TRUE,
seed = 1,
maximize = FALSE,
param_comb = gridsearch_params)## [1] train-rmse:57.852969+0.476198 test-rmse:57.808562+0.418858
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:41.295639+0.309401 test-rmse:41.291347+0.505362
## [3] train-rmse:29.748447+0.228122 test-rmse:29.765895+0.475301
## [4] train-rmse:21.845696+0.196288 test-rmse:21.978830+0.429040
## [5] train-rmse:16.445210+0.128355 test-rmse:16.635187+0.368547
## [6] train-rmse:12.747118+0.158243 test-rmse:13.047995+0.345544
## [7] train-rmse:10.404637+0.179678 test-rmse:10.741460+0.333321
## [8] train-rmse:8.843113+0.111802 test-rmse:9.243221+0.306017
## [9] train-rmse:7.797505+0.063773 test-rmse:8.272533+0.278479
## [10] train-rmse:7.168529+0.099072 test-rmse:7.700926+0.239920
## [11] train-rmse:6.779529+0.095536 test-rmse:7.327917+0.220856
## [12] train-rmse:6.484708+0.106734 test-rmse:7.035060+0.215553
## [13] train-rmse:6.278815+0.121913 test-rmse:6.824892+0.214558
## [14] train-rmse:6.121293+0.103468 test-rmse:6.684477+0.217459
## [15] train-rmse:6.008845+0.105604 test-rmse:6.587081+0.230001
## [16] train-rmse:5.914447+0.095200 test-rmse:6.483416+0.248313
## [17] train-rmse:5.831796+0.083372 test-rmse:6.413739+0.266887
## [18] train-rmse:5.762734+0.078059 test-rmse:6.362844+0.254265
## [19] train-rmse:5.709274+0.075766 test-rmse:6.314493+0.259974
## [20] train-rmse:5.659984+0.076543 test-rmse:6.268622+0.250437
## [21] train-rmse:5.614794+0.069840 test-rmse:6.238095+0.251149
## [22] train-rmse:5.568781+0.088766 test-rmse:6.213908+0.238872
## [23] train-rmse:5.529566+0.089851 test-rmse:6.177584+0.234247
## [24] train-rmse:5.487238+0.079895 test-rmse:6.129675+0.236513
## [25] train-rmse:5.457495+0.075045 test-rmse:6.120602+0.241391
## [26] train-rmse:5.420269+0.072538 test-rmse:6.084610+0.250495
## [27] train-rmse:5.394824+0.072564 test-rmse:6.074964+0.247039
## [28] train-rmse:5.376932+0.069403 test-rmse:6.060184+0.242153
## [29] train-rmse:5.351277+0.072629 test-rmse:6.055778+0.246861
## [30] train-rmse:5.319740+0.067331 test-rmse:6.030624+0.250858
## [31] train-rmse:5.298846+0.065196 test-rmse:6.021959+0.251812
## [32] train-rmse:5.275055+0.058685 test-rmse:5.992801+0.249521
## [33] train-rmse:5.256347+0.061291 test-rmse:5.986317+0.242775
## [34] train-rmse:5.234787+0.059736 test-rmse:5.971700+0.244410
## [35] train-rmse:5.209314+0.057399 test-rmse:5.957965+0.237388
## [36] train-rmse:5.193078+0.052812 test-rmse:5.957135+0.237159
## [37] train-rmse:5.166185+0.052154 test-rmse:5.947602+0.233346
## [38] train-rmse:5.151848+0.054320 test-rmse:5.948301+0.234458
## [39] train-rmse:5.135561+0.052012 test-rmse:5.946514+0.228293
## [40] train-rmse:5.121800+0.052016 test-rmse:5.948655+0.233832
## [41] train-rmse:5.104253+0.048436 test-rmse:5.929761+0.236392
## [42] train-rmse:5.093572+0.047828 test-rmse:5.928085+0.232895
## [43] train-rmse:5.082788+0.047241 test-rmse:5.926136+0.235880
## [44] train-rmse:5.066347+0.047106 test-rmse:5.928291+0.246709
## [45] train-rmse:5.053846+0.043688 test-rmse:5.923201+0.246024
## [46] train-rmse:5.037406+0.045706 test-rmse:5.921876+0.247585
## [47] train-rmse:5.019601+0.047767 test-rmse:5.910711+0.234497
## [48] train-rmse:5.000153+0.051261 test-rmse:5.917173+0.226700
## [49] train-rmse:4.983910+0.053864 test-rmse:5.904486+0.215948
## [50] train-rmse:4.967404+0.052445 test-rmse:5.895665+0.210841
## [51] train-rmse:4.958440+0.049448 test-rmse:5.895893+0.214413
## [52] train-rmse:4.942164+0.051068 test-rmse:5.898594+0.201320
## [53] train-rmse:4.929109+0.049666 test-rmse:5.905408+0.196692
## [54] train-rmse:4.914176+0.049694 test-rmse:5.907152+0.207105
## [55] train-rmse:4.897802+0.047627 test-rmse:5.890759+0.212987
## [56] train-rmse:4.887218+0.045709 test-rmse:5.887277+0.219786
## [57] train-rmse:4.875469+0.044760 test-rmse:5.888478+0.220310
## [58] train-rmse:4.860128+0.048115 test-rmse:5.889477+0.234508
## [59] train-rmse:4.851516+0.047451 test-rmse:5.884766+0.236204
## [60] train-rmse:4.839195+0.043259 test-rmse:5.882277+0.231871
## [61] train-rmse:4.827321+0.042784 test-rmse:5.884144+0.237330
## [62] train-rmse:4.818399+0.044790 test-rmse:5.893013+0.228989
## [63] train-rmse:4.808529+0.042725 test-rmse:5.896339+0.228086
## [64] train-rmse:4.793826+0.037079 test-rmse:5.895039+0.230940
## [65] train-rmse:4.782079+0.036894 test-rmse:5.894319+0.234007
## [66] train-rmse:4.769924+0.037295 test-rmse:5.890745+0.227343
## [67] train-rmse:4.757866+0.041325 test-rmse:5.885826+0.219757
## [68] train-rmse:4.745465+0.036225 test-rmse:5.879309+0.217778
## [69] train-rmse:4.732925+0.034829 test-rmse:5.876815+0.217111
## [70] train-rmse:4.719882+0.033528 test-rmse:5.877755+0.218805
## [71] train-rmse:4.711788+0.030563 test-rmse:5.874746+0.212870
## [72] train-rmse:4.695824+0.029547 test-rmse:5.870732+0.211965
## [73] train-rmse:4.686349+0.031026 test-rmse:5.871659+0.218814
## [74] train-rmse:4.676199+0.029560 test-rmse:5.871074+0.221318
## [75] train-rmse:4.667050+0.030636 test-rmse:5.872532+0.219028
## [76] train-rmse:4.656570+0.030979 test-rmse:5.874747+0.213028
## [77] train-rmse:4.647919+0.031589 test-rmse:5.874546+0.206055
## [78] train-rmse:4.638765+0.029053 test-rmse:5.875564+0.201671
## [79] train-rmse:4.626832+0.029337 test-rmse:5.868786+0.198773
## [80] train-rmse:4.618170+0.028152 test-rmse:5.864057+0.201775
## [81] train-rmse:4.602715+0.027356 test-rmse:5.867879+0.208803
## [82] train-rmse:4.591082+0.026813 test-rmse:5.868856+0.208424
## [83] train-rmse:4.579448+0.026370 test-rmse:5.873090+0.201267
## [84] train-rmse:4.569218+0.027652 test-rmse:5.872276+0.197501
## [85] train-rmse:4.562559+0.026343 test-rmse:5.872017+0.200561
## [86] train-rmse:4.554654+0.026723 test-rmse:5.864397+0.202739
## [87] train-rmse:4.543161+0.026600 test-rmse:5.866221+0.206257
## [88] train-rmse:4.533439+0.022644 test-rmse:5.867745+0.202166
## [89] train-rmse:4.527212+0.020910 test-rmse:5.866962+0.199708
## [90] train-rmse:4.519951+0.020538 test-rmse:5.868707+0.194761
## Stopping. Best iteration:
## [80] train-rmse:4.618170+0.028152 test-rmse:5.864057+0.201775
# update the model parameters with the optimal values
params$max_depth <- cv_result$best_param["max_depth"]
params$min_child_weight <- cv_result$best_param["min_child_weight"]
# train the final model using the optimal parameters and number of rounds
xgboostModel <- xgb.train(params, dtrain, nrounds = 10)Feature Importance of the model
xgb.importance(colnames(train[,-1]), model = xgboostModel)## Feature Gain Cover Frequency
## 1: Month_num 0.6336764121 0.266527793 0.17692308
## 2: Ad_Spend 0.1868337666 0.331268501 0.39230769
## 3: COVID_Lockdown 0.1047144240 0.078431641 0.05769231
## 4: Year_num 0.0273856721 0.059519241 0.04230769
## 5: Phone_24 0.0161178323 0.029560903 0.02692308
## 6: Year_month 0.0127830274 0.086654424 0.12692308
## 7: Positive_News 0.0086914814 0.064768118 0.03076923
## 8: Negative_News 0.0072239186 0.046677996 0.07692308
## 9: Competition 0.0011799283 0.018556079 0.02307692
## 10: Ultra_Edition 0.0009779711 0.011128166 0.01538462
## 11: Day_num 0.0004155663 0.006907137 0.03076923
predictions <- predict(xgboostModel, dtrain)
# calculate the mean squared error of the predictions
mse_xgboost <- mean((train[, "Sales"] - predictions)^2)The mean square error of this machine learning model using XGboost is
rmse_xgboost`, which much higher than the linear regession
model of 1.7406663^{-19}. Therefore, I will use the multiple linear
regression above for the making the final prediction.
As noted above, the multiple linear regression model has shown to perform the best compared to the Poisson regression model and
linear_reg_predict <-
predict.lm(linear_reg_fit_final,
newdata = dec_ad_final,
type = "response",
se.fit = TRUE) ## Warning in predict.lm(linear_reg_fit_final, newdata = dec_ad_final, type =
## "response", : prediction from a rank-deficient fit may be misleading
dec_ad_final$Sales <- round(linear_reg_predict$fit)
dec_ad_final %>%
select(Date, Sales) %>%
knitr::kable()| Date | Sales |
|---|---|
| 2020-12-01 | 108 |
| 2020-12-02 | 105 |
| 2020-12-03 | 100 |
| 2020-12-04 | 105 |
| 2020-12-05 | 124 |
| 2020-12-06 | 94 |
| 2020-12-07 | 111 |
| 2020-12-08 | 106 |
| 2020-12-09 | 116 |
| 2020-12-10 | 95 |
| 2020-12-11 | 99 |
| 2020-12-12 | 112 |
| 2020-12-13 | 123 |
| 2020-12-14 | 102 |
| 2020-12-15 | 100 |
| 2020-12-16 | 109 |
| 2020-12-17 | 105 |
| 2020-12-18 | 98 |
| 2020-12-19 | 106 |
| 2020-12-20 | 121 |
| 2020-12-21 | 98 |
| 2020-12-22 | 126 |
| 2020-12-23 | 103 |
| 2020-12-24 | 116 |
| 2020-12-25 | 94 |
| 2020-12-26 | 113 |
| 2020-12-27 | 95 |
| 2020-12-28 | 94 |
| 2020-12-29 | 135 |
| 2020-12-30 | 111 |
| 2020-12-31 | 107 |
The linear regression model predicts the total unit sales in December 2020 is 3331 with 95% confidence intervals of 3231 and 3430. Thus, the model suggests that the target sale of 3900 units in December 2020 is unlikely to meet.
The plot below is an interactive plot that contains both the historical (green colour-coded) and forecast (red colour-coded) sales with 95% confidence intervals. The user can hover the mouse cursor over the plot to see the actual estimates.
market_sale_final$Type <- "Historical"
dec_ad_final$Type <- "Forecast"
dec_ad_final$Sales_max <- with(linear_reg_predict, fit + 1.96*se.fit) |> round()
dec_ad_final$Sales_min <- with(linear_reg_predict, fit - 1.96*se.fit) |> round()
market_sale_final$Sales_max <-
market_sale_final$Sales_min <-
market_sale_final$Sales
g <-
market_sale_final %>%
bind_rows(dec_ad_final) %>%
mutate(Type = factor(Type)) %>%
ggplot(aes(x = Date, y = Sales, col = Type, group = Type)) +
geom_path() +
geom_ribbon(aes(ymin = Sales_min, ymax = Sales_max, fill = Type), alpha = 0.2) +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light()
plotly::ggplotly(g) %>% plotly::hide_legend()In conclusion, multiple linear regression appears to perform the best here
The model explains 94.51% of the variation in the total unit sold and hsould therefore be an accurate model for prediction.
An alternative method is using a time-series-based
analysis with the forecast R package, which can
better handle autocorrelation, seasonality, and other temporal patterns
in the data. However, applying the method within the
forecast R package requires the dataset to be converted to
a Time-Series (ts) object. This process can be complicated,
especially if we need to include other predictors such as “Total
Advertising Spend” in the model for inference and prediction.